CUDA 执行模型将你的计算机转变为一个高性能的异构系统。想象一位 总指挥(主机/中央处理器) 和一支 数千人的军队(设备/图形处理器)。总指挥负责处理复杂的逻辑和决策,而军队则同时执行大量重复的任务。
1. 架构差异
主机 是专为复杂控制流和串行任务优化的低延迟中央处理器。相反, 设备 是专为高吞吐量设计的图形处理器,包含成千上万的简单核心,能够同时在庞大的数据集上执行相同的指令。 是专为高吞吐量设计的图形处理器,包含成千上万的简单核心,能够同时在庞大的数据集上执行相同的指令。
2. 执行节奏
CUDA 程序以一系列阶段运行。执行从主机开始,处理“串行代码”。当程序遇到“并行内核”时,它会向设备启动一个 线程网格 。设备完成其大规模工作后,控制权返回主机。
3. 性能专业化
该模型充分利用了两者的优点:中央处理器管理系统资源和复杂分支,而图形处理器执行 SPMD(单程序多数据) 逻辑,以并行方式处理数据元素。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
Which architecture is characterized as being 'throughput-optimized'?
The Host (Intel® CPU)
The Device (NVIDIA® GPU)
The System RAM
The PCIe Bus
✅ Correct!
Correct! GPUs are designed to maximize the total amount of work (throughput) done per unit of time by processing thousands of data points simultaneously.❌ Incorrect
The Host (CPU) is 'latency-optimized' to minimize the time a single thread takes to execute.QUESTION 2
The reader should complete Part 1 of the MatrixMultiplication() example in Figure 3.6 with similar declarations of an Nd and a Pd pointer variable as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with mandatory calls.
float *Nd, *Pd; cudaMalloc((void**)&Nd, size); ... cudaFree(Nd);
float Nd, Pd; malloc(&Nd, size); ... free(Nd);
float *Nd, *Pd; cudaMemcpy(Nd, Pd, size); ... delete Nd;
int Nd, Pd; Nd = new float[size]; ... free(Nd);
✅ Correct!
Exactly. You must declare pointers for the device, use cudaMalloc with a double-pointer cast, and use cudaFree to release the memory.❌ Incorrect
Standard C malloc/free or C++ new/delete cannot be used to manage Device (GPU) memory.QUESTION 3
In the CUDA execution model, where does a program always begin its execution?
On the Device (GPU)
Simultaneously on both
On the Host (CPU)
In the Global Memory
✅ Correct!
Correct. Execution starts with the serial code on the Host (CPU).❌ Incorrect
The GPU only begins work when a Kernel is specifically launched by the Host.QUESTION 4
What happens when the Host encounters a phase with rich data parallelism?
It speeds up its clock frequency.
It launches a Kernel onto the Device.
It stores the data in the Host Cache.
It converts the code to Python.
✅ Correct!
Yes! The Host 'offloads' the parallel work by launching a kernel on the massive core array of the GPU.❌ Incorrect
The CPU is not optimized for massive data parallelism; it offloads such work to the Device.QUESTION 5
A student attempts to launch a 1024x1024 matrix multiplication on G80 hardware using 1024 blocks, where each thread calculates one element. Why will this fail?
The G80 cannot handle 1024 blocks.
The total number of threads exceeds 1 million.
The configuration results in 1024 threads per block, exceeding the 512 hardware limit.
Matrix multiplication is not data parallel.
✅ Correct!
Precisely. 1,048,576 elements divided by 1024 blocks results in 1024 threads per block, which exceeds the G80 architecture limit of 512.❌ Incorrect
Check the thread-per-block limit for the G80 architecture: it is 512.Case Study: High-Resolution Fluid Dynamics
Optimizing a Heterogeneous Simulation
You are developing a fluid dynamics engine. The simulation involves: (A) Calculating the user interface and file logging, (B) Computing the pressure gradients for 20 million fluid cells, and (C) Updating the simulation time-step based on global convergence tests. You must decide how to map these tasks to the CUDA execution model.
Q
1. Which task (A, B, or C) should definitely remain on the Host, and why?
Solution:
Task A (UI and Logging) and Task C (Time-step logic) should remain on the Host. These tasks are serial in nature, involve complex I/O and control logic, and do not benefit from throughput optimization. The Host is designed to minimize the latency of these single-threaded tasks.
Task A (UI and Logging) and Task C (Time-step logic) should remain on the Host. These tasks are serial in nature, involve complex I/O and control logic, and do not benefit from throughput optimization. The Host is designed to minimize the latency of these single-threaded tasks.
Q
2. How does the 'alternating phases' concept apply to the interaction between tasks B and C?
Solution:
The program enters a loop where the Host launches Task B (the parallel pressure kernel) on the Device. Once Task B completes (synchronization), control returns to the Host to perform Task C (serial convergence check). This repeats for every time-step in the simulation.
The program enters a loop where the Host launches Task B (the parallel pressure kernel) on the Device. Once Task B completes (synchronization), control returns to the Host to perform Task C (serial convergence check). This repeats for every time-step in the simulation.